Assignment 2 - ENVS563
Exploring Various Datasets on Bay Area and Finding Useful Insights for Policymakers.
Student ID : 201763050
Module Leader: Dr. Elisabetta Pietrostefani
Department of Geography and Planning
School of Environmental Sciences
University of Liverpool
This Computational Essay is submitted as a part of assessment in the module of
Geographic Data Science ENVS563
December 2023
Installing Packages
Below code will check whether packages listed within variable (packages_required) are installed and not. If not installed, It will automatically installs and loaded all required packages for the Analysis.
packages_required <- c("tidycensus", "tigris","tidyverse", "data.table","sf", "tmap", "tmaptools","readr", "geojsonsf", "osmdata","devtools", "RColorBrewer","classInt","R.utils","dplyr", "ggplot2", "viridis","raster","terra","exactextractr", "tidyterra", "tibble", "rosm", "spdep", "cluster", "GGally", "rgeoda", "tidyr", "gridExtra","patchwork")
# Include any additional packages needed, in above variable .
new_req_packages <- packages_required[!(packages_required %in% installed.packages()[,"Package"])] # checking which packages are already installed and filtering out installed packages, new variable is assigned for names of uninstalled packages.
if(length(new_req_packages)) install.packages(new_req_packages) # packages(not installed yet) are being installed.
for(i in 1:length(packages_required)) {
library(packages_required[i], character.only = T) # loading all packages named in packages_required
}
# Below message is differently coded output for checking installed packages, not the output of above line[1] “All required packages for Assignment-2 are installed and loaded properly.”
Introduction
Rationale for CRS
According to USGS Website, all maps in USA uses some projection in the State Plane Coordinate System (United States Geological Survey, 2023). It is plane coordinate system in which each state is divided up to six zones, depending on geometry of the state (United States Geological Survey, 2023). These zones use a Mercator or Lambert conic projection which preserves the shape and scale over smaller areas like zcta’s or tracts.
By looking into Map of USA State Plane Zones NAD83 in ArcGIS website (ArcGIS Hub, 2020), we can identify that Bay Area is in California Zone 3. Therefore, we are going to use EPSG: 2227 – NAD 83 California zone 3 (ftUS).
Description of Data
Airbnb Listings Data of San Francisco, Oakland and San Mateo are extracted from Inside Airbnb Website. Chosen Variables from the list of socio-economic metrics of the ACS are Median Household Income (B19013_001E) and No. Of People aged 18-64 who has computer and Broadband Internet Subscription (B28005_011E). We also use Bay Area Zip Code Areas.
For additional Datasets, we are going to use No. of Vehicles (light-duty) 2020 per zip code and Average Electricity Utility Rate 2020 (Commerical Rate) per zip code, both investors owned and non-investor owned utilities.
Note: API Key can be obtained by signing up at http://api.census.gov/data/key_signup.html for Census Data. We, then use tidycensus package to get ACS data from US Census Bureau.
census_api_key(J_API_KEY , install = TRUE, overwrite = TRUE) # Please replace 'J_API_KEY' with API Key obtained from census bureau
# Below message is differently coded output for checking system environment variable(CENSUS_API_KEY), not the output of above line[1] “API key to get Census data is successfully set up.”
Data Extraction and Pre-Processing
We are making two API calls to retrieve Median Household Income (B19013_001E) and the No. of people aged 18-64 who have a computer and Broadband Internet Subscription (B28005_011E) from ACS 2016-2020 separately. This approach is more efficient and straight-forward than making a single API call with two variables. The latter can results in a spatial or non-spatial file with a long-format attribute table, which, in turn, needs to be converted into a wide-format table for analysis.
We are extracting values of two variables for entire US. Then Later, we filter out zip codes specific to Bay Area. This is because, some califorina zcta’s extends into other states, Which results in error while extracting zcta’s related data for specific states (‘zip-Determining which US zipcodes map to more than one state or more than one city?-Geographic Information Systems’, 2021). According to US Census Bureau, All ZCTA’s codes are valid US Zip Codes(United States Census Bureau, 2023).
median_hh_income <- get_acs( # get_acs is a function, we are using to get ACS Suvrey data from US Census Bureau
geography = "zcta", # we are specifying geography unit for variable we getting, like, ZCTA, Tract, County, State etc.
variables = "B19013_001E", # we are geeting MEDIAN HOUSEHOLD INCOME, "B19013_001E" is variable name in ACS
year = 2020, # we are getting census data collected from 2016-2020
geometry = FALSE # we use geometry of bay area zipcode shapefile from berkeley library, not ACS Data
) %>%
rename(estimate_MedianHhI = estimate) %>% # we are renaming retrieved variable(estimate to appropriate name ( estimate_Median_HhI)
select(GEOID, estimate_MedianHhI) # we are selecting only GEOID(ZCTA's CODE, also ZIP CODE), estimate Value (ignoring MOE)
# Retrieving Internet Use Variable
internet_use <- get_acs( # get_acs is a function, we are using to get ACS Suvrey data from US Census Bureau
geography = "zcta", # we are specifying geography unit for variable we getting, like, ZCTA, Tract, County, State etc.
variables = "B28005_011E", # we are geeting No.of People aged 18-64, who has computer and Broadband Internet Subscription, "B28005_011E" is variable name in ACS
year = 2020, # we are getting census data collected from 2016-2020
geometry = FALSE # we use geometry of bay area zipcode shapefile from berkeley library, not ACS Data
) %>%
rename(estimate_InternetUse = estimate) %>% # we are renaming retrieved variable(estimate to appropriate name ( estimate_Internet_Use)
select(GEOID, estimate_InternetUse) # we are selecting only GEOID(ZCTA's CODE, also ZIP CODE), estimate Value (ignoring MOE)
us_zip <- left_join(median_hh_income, internet_use, by = "GEOID") # simply join two datasets by GEOID to get Dataset in req format head(us_zip) #showing first six rows of Dataset# A tibble: 6 × 3
GEOID estimate_MedianHhI estimate_InternetUse
<chr> <dbl> <dbl>
1 00601 14398 6839
2 00602 16771 10955
3 00603 15786 21193
4 00606 14980 1443
5 00610 20167 7742
6 00612 19402 27116
We downloaded the Bay Area listings data (San Francisco, Oakland and San Mateo) from Airbnb Website and Bay Area Zip-Code Areas from Berkeley Library. Loading all files by below code.
# Loading airbnb Listings downloaded from AIRBNB website for areas(san francisco, oakland and san mateo)
listings_oakland <- read.csv("/home/jay/Documents/ENVS563-Geographic_Data_Science/Assignment-2/data/tables/Oakland/listings.csv")
listings_sanfrancisco <- read.csv("/home/jay/Documents/ENVS563-Geographic_Data_Science/Assignment-2/data/tables/SanFrancisco/listings.csv")
listings_sanmateo <- read.csv("/home/jay/Documents/ENVS563-Geographic_Data_Science/Assignment-2/data/tables/SanMateo/listings.csv")
# Row Binding all three datasets to produce specified bay area's Airbnb Listings
listings_bayarea <- rbind(listings_sanfrancisco, listings_sanmateo, listings_oakland)
# Loading Bay area ZCTA's or ZIPCODE Areas for geometry which are downloaded from Berkeley Library
bayarea_zipcodes <- read_sf("/home/jay/Documents/ENVS563-Geographic_Data_Science/Assignment-2/data/bayarea_zipcodes/bayarea_zipcodes.shp")Two Additional Datasets:
# We are using Commerical Electricity Rate Dataset for Both investor-owned and non-investor owned utilities per ZIP Code
iou_data <- read.csv("/home/jay/Documents/ENVS563-Geographic_Data_Science/Assignment-2/data/tables/iou_zipcodes_2020.csv")
non_iou_data <- read.csv("/home/jay/Documents/ENVS563-Geographic_Data_Science/Assignment-2/data/tables/non_iou_zipcodes_2020.csv")
# We are using No.of Vehicles per ZIP Code dataset
vehicle_count_2020 <- read.csv("/home/jay/Documents/ENVS563-Geographic_Data_Science/Assignment-2/data/tables/vehicle-count-2020.csv")non_iou_data_c <- non_iou_data %>% # Data cleaning process for additional datasets,
select(zip, comm_rate) %>% # we select only zip, commerical rate from non-investor based utilities
group_by(zip) %>% # we group them by ZIP
summarise(
Avg_Non_IOU_Comm_Rate = mean(comm_rate, na.rm = TRUE) # Mean of different Non-Investor Based utilities per ZIP
)
iou_data_c <- iou_data %>%
select(zip, comm_rate) %>% # we select only zip, commerical rate from Investor based utilities
group_by(zip) %>% # we group them by ZIP
summarise(
Avg_IOU_Comm_Rate = mean(comm_rate, na.rm = TRUE) # Mean of different Investor Based utilities per ZIP
)
electric_u <- non_iou_data_c %>%
full_join(iou_data_c, by = "zip")%>% # Join both Investor-owned and Non-Investor utility datasets by zipcode
mutate(
Avg_IOU_Comm_Rate = replace_na(Avg_IOU_Comm_Rate, 0), # replacing any NA values with 0
Avg_Non_IOU_Comm_Rate = replace_na(Avg_Non_IOU_Comm_Rate, 0 ), # replacing any NA values with 0
Combined_Avg_Comm_Rate = (Avg_Non_IOU_Comm_Rate + Avg_IOU_Comm_Rate ) / 2 # using average as statistic for combined Commerical electricity Rate
)%>%
select(zip, Combined_Avg_Comm_Rate) %>% rename(ZIP = zip) %>% mutate(ZIP = as.character(ZIP)) # selecting only ZIP and New Statistic, ignoring rest
vc_califorina <- vehicle_count_2020 %>%
select(-Date, -Model.Year, -Fuel, -Make) %>% # we are only selecting Zip.code, Vehicles and Duty cols, and ignoring rest
filter(Duty == "Light" & Zip.Code != "OOS") %>% # we are filter only Light-Duty vehicles and removing Out of Service(OOS) locations from data
group_by(Zip.Code) %>% # group by ZIP Code
summarise(No_of_Light_Duty_Vehicles = sum(Vehicles)) %>% rename( ZIP = Zip.Code) # summing up Light-Duty Vehicles per ZIP and then rename itMethodology
MAP 1: Count and Mean Price of AIRBNB Listings per ZIP Code Areas
Bay Area Airbnb’s Listings, We downloaded have Longitude and Longitude coordinates, So initially, We set CRS of Listings to EPSG: 4326 - WGS84. Then, We perform CRS re-projection into EPSG: 2227 – NAD 83 California zone 3 (ft-US).
We then, perform spatial join for Bay Area ZIP Code Areas and AIRBNB Listings, which results in Overlay Shape-file. This will enable us to understand geographical distribution of AIRBNB Listings and their charactertics over Bay Area ZIP Code Areas.
listings_bayarea_sf <- listings_bayarea %>%
st_as_sf(coords = c("longitude", "latitude")) %>% # creating point shapefile from coordinates
st_set_crs(4326) # setting CRS to geodetic system of earth (4326, WGS84) as they are long, lat coords
listings_bayarea_sf_transform <- st_transform(listings_bayarea_sf, crs = 2227) # Now, Changing CRS to mentioned Project CRS
overlay_listings <- st_join(bayarea_zipcodes, listings_bayarea_sf_transform) # Performing Spatial Join to create overlay Shapefile of Airbnb Listings and Bay area ZIP shapefile
st_crs(overlay_listings) == st_crs(listings_bayarea_sf_transform) # Verifying CRS of New shapefile[1] TRUE
We need to plot No. of Airbnb’s and their mean price per zipcode in Bay Area. To find out necessary variables, we group by Zip Code and count no. of non-NA price rows of listings in each zip code and use function(mean) to find mean price of those listings.
listings_bayarea_zip <- overlay_listings %>% # creating new dataframe with no.of Airbnb's per ZIP and Mean Price per ZIP
group_by(ZIP) %>% # group by ZIP
summarise(
count_airbnb = sum(!is.na(price)), # counting non-NA price rows for listings
mean_price = mean(price, na.rm = TRUE) # calculate mean, ignoring NAs
)%>%
filter(count_airbnb != 0) listings_bayarea_zip1 <- listings_bayarea_zip %>%
select(count_airbnb, everything()) # Changing the order of attributes for interactive map
listings_bayarea_zip2 <- listings_bayarea_zip %>%
select(mean_price, everything())%>% # Changing the order of attributes for interactive map
mutate(mean_price = round(mean_price, 1)) # rounding mean price to one decimal place.Summary of Airbnb's Count per Zipcode Variable(count_airbnb):
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 71.0 172.0 196.5 257.0 794.0
Summary of Mean Price per Zipcode Variable(mean_price):
Min. 1st Qu. Median Mean 3rd Qu. Max.
74.0 154.2 204.0 286.5 282.6 1818.0
Standard Deviation of Variable(count_airbnb): 162.5229
Standard Deviation of Variable(mean_price): 297.8739
Based on standard statistics for both Count and Mean Price variables of AIRBNB’s per zip code, Minimum value of count variable is 1 and first quartile is 71, so first class is divided accordingly. Median of count variable is 172 and standard deviation is 162.5, so, second and third class is divided as defined. Finally to identify outliers, As Third quartile is 257, standard deviation is 162.5 and Maximum Value is 794, last class has 500 as its lower boundary.
Minimum value of Mean Price variable is 71 and first quartile is 154, so first class is divided accordingly. Median of Mean Price is 204 and standard deviation is 297.9, so, second and third class is divided as defined. Finally to identify outliers, As Third quartile is 282.6, standard deviation is 297.9 and Maximum Value is 1818, last two classes have been divided 1000 and 1800 as their lower boundaries respectively.
New forms of spatial data like Airbnb data can offer latest snaphots of how people are are using spaces and leads to real-time insights. There will generally have high Heteroskedasticity and also provide detailed data at granular level such as individual Airbnbs. On other hand, New form of spatial data like Airbnb data never represent all segments of markets. Collection of Data can introduce measurement errors due to less standard methodologies.
Airbnb data is dynamic and reflects real-time changes in market, like social media or GPS data. Like social Media or GPS data, Airbnb data is user-generated. This is quite different from census-type of data.
tmap_mode("view") # Interactive mode
map1.11 <- tm_shape(listings_bayarea_zip1) + # listings_bayarea_zip1 is just re-arrangement of listings_bayareaa_zip with first column as count_airbnb
tm_fill( "count_airbnb",style="fixed", title = "No. of Airbnb's per ZIP", alpha = 1, breaks=c( 1, 72, 172, 359, 500, Inf), palette = "-viridis")+
tm_borders()+
tm_text("ZIP", size = 0.3)+
tm_view(set.view = c(center_lon, center_lat, zoom_level))
map1.12 <- tm_shape(listings_bayarea_zip2) + # listings_bayarea_zip2 is just re-arrangement of listings_bayareaa_zip with first column as Mean Price
tm_fill( "mean_price",style="fixed", title = "Mean Price of Airbnb per ZIP", alpha = 1, breaks=c( 74, 155, 287, 501,1000, 1800, Inf), palette = "-viridis")+
tm_borders()+
tm_text("ZIP", size = 0.3)+
tm_view(set.view = c(center_lon, center_lat, zoom_level))
tmap_arrange(map1.11, map1.12)MAP 2: Median Household Income and No.of People aged 18-64 who has Computer and Internet Subscription per ZIP Code Areas
We filter Median Household Income and Internet Use Variable for Bay Area Zip Codes. Then we join ACS Dataset to Bay Area ZIP Code Areas shapefile. As an additional step, we remove all ZIP codes where Internet Use variable is missing.
bay_area_zipcodes <- bayarea_zipcodes$ZIP # Creating new var with Bayarea ZIP codes
bayarea_zip_vars_data <- us_zip %>% # Filtering ACS Data with Two variables, by Bay Area ZIP codes
filter(GEOID %in% bay_area_zipcodes)%>% # Using Bay Area ZIP codes variable, created above
rename(ZIP = GEOID) # renaming GEOID with ZIP
bayarea_zip_vars <- bayarea_zipcodes %>% left_join(bayarea_zip_vars_data, by = "ZIP") %>% filter(!(is.na(estimate_InternetUse))) # Joining two vars from ACS with bayarea_zipcodes shapefile which is downloaded from berkeley library and also removing NA values of Internet Use variable.head(bayarea_zip_vars)Simple feature collection with 6 features and 7 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 5936840 ymin: 2232344 xmax: 6251037 ymax: 2506800
Projected CRS: NAD83 / California zone 3 (ftUS)
# A tibble: 6 × 8
ZIP PO_NAME STATE Area__ Length__ geometry
<chr> <chr> <chr> <dbl> <dbl> <MULTIPOLYGON [US_survey_foot]>
1 94558 NAPA CA 12313263537 995176. (((6102909 2377436, 6102846 23769…
2 95620 DIXON CA 7236949521. 441860. (((6230747 2302747, 6219259 23030…
3 95476 SONOMA CA 3001414165. 311319. (((6013411 2248863, 6013202 22488…
4 94559 NAPA CA 1194301745. 359105. (((6045939 2248058, 6044555 22480…
5 94533 FAIRFIELD CA 991786103. 200773. (((6146297 2299595, 6146371 22987…
6 94954 PETALUMA CA 2006544443. 267474. (((5998504 2235043, 5991182 22339…
# ℹ 2 more variables: estimate_MedianHhI <dbl>, estimate_InternetUse <dbl>
We remove all ZIP codes where Median Household Income variable is missing.
bayarea_zip_vars1 <- bayarea_zip_vars %>%
filter(!(is.na(estimate_MedianHhI))) # Removing rows of bayarea_zip_vars where 'estimate_MedianHhI' is NASummary of Median Household Income per Zipcode (estimate_MedianHhI):
Min. 1st Qu. Median Mean 3rd Qu. Max.
40813 89583 111037 122296 153256 250001
Summary of No.of People aged 18-64 who has Computer and Internet Subscription
per Zipcode (estimate_InternetUse):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 8110 16235 17830 24914 59638
Standard Deviation of Variable(estimate_MedianHhI): 45357
Standard Deviation of Variable(estimate_InternetUse): 12765.53
Based on standard statistics for both variables of ACS per zip code, Range and Standard Deviation for both variables are large, We use Fisher-Jenks Data Classification method, because of highly-skewed data.
Bay Area ZIP Code Areas is Mixed-Use Type of Neighborhood. Most of neighborhoods have medium Household Incomes and High Internet Subscriptions. Socio-Economic Indicators like Median Household Income and Internet Use helps in delineating this typology of Neighborhood. San Francisco and Oakland have more-affulent neighborhoods due to High Internet Usage and medium Household Income, whereas San Mateo have High Household Income Neighborhoods and Medium Internet Usage.
Based on the given Classification, Reasonable Hypothesis will be that AIRBNB would cluster in San Francisco and Oakland, due to High Internet Usage and Medium Household Income (Middle to High-Class Residential Areas). AIRBNB listings would also cluster in Areas with low commercial electricity rate and High No.of Vehicles.
tmap_mode("plot")
map2.1 <- tm_basemap() +
tm_shape(bayarea_zip_vars1) + # shapefile for map 2
tm_fill(col = "estimate_MedianHhI", palette = "YlGn", n = 5, style = "fisher", title = "Median Household Income (in $ USD)") + # estimate_MedianHhI
tm_compass(position = c(0.85, 0.85)) + # Adding Compass
tm_scale_bar(position = c(0.625, 0.02)) + # Adding Scalebar
tm_borders(lwd = 0.3)+ # Borders for zcta's polygons
tm_layout(
legend.title.size = 0.715,
legend.text.size = 0.6,
inner.margins = c(0.22, 0.13, 0.02, 0.08),
legend.position = c(0.05, 0.02),
legend.width = 0.5,
bg.color = "#eaf5fa",
main.title = "Median Household Income per ZIP Code", # Add the map title within tm_layout
main.title.size = 0.7,
main.title.position = "center"
)
map2.2 <- tm_basemap() +
tm_shape(bayarea_zip_vars) +
tm_fill(col = "estimate_InternetUse", palette = "YlGn", n = 5, style = "fisher", title = "Broadband Internet Connection (People)") +
tm_compass(position = c(0.85, 0.85)) +
tm_scale_bar(position = c(0.625, 0.02)) +
tm_borders(lwd = 0.3)+
tm_layout(
legend.title.size = 0.715,
legend.text.size = 0.6,
inner.margins = c(0.185, 0.13, 0.02, 0.08),
legend.position = c(0.05, 0.02),
legend.width = 0.5,
bg.color = "#eaf5fa",
main.title = "No. of People aged (18-64) using Broadband Internet per ZIP Code", # Add the map title within tm_layout
main.title.size = 0.7,
main.title.position = "center"
)tmap_arrange(map2.1, map2.2)MAP 3: AIRBNB Listings and Internet Use Variable (Combining Datasets)
We change the order of attributes in Bay Area ZIP Codes as following:
bayarea_zip_vars_map3 <- bayarea_zip_vars %>%
select(estimate_InternetUse, everything()) # Changing the order of attributes for interactive mapWe calculate Natural Logarithm of Price of Airbnb Listings in Overlay Shapefile and Checking CRS of both shapefiles:
listings_bayarea_sf_transform_copy <- listings_bayarea_sf_transform
listings_bayarea_sf_transform_copy$nl_price <- log(listings_bayarea_sf_transform_copy$price) # calculating natural logarithm of price ( log to base e)
st_crs(bayarea_zip_vars_map3) == st_crs(listings_bayarea_sf_transform_copy) # checking CRS of both shapefiles[1] TRUE
We are changing the order of attribute in Overlay Listings Shapefile as following:
listings_bayarea_sf_transform_copy <- listings_bayarea_sf_transform_copy %>%
select(nl_price, everything()) %>% mutate(nl_price = round(nl_price, 1))Based on standard statistics for Internet Usage(People) per zip code and Natural Logarithm of AIRBNB Price, Range and Standard Deviation are large, We use Fisher-Jenks Data Classification method, because of highly-skewed data.
AIRBNB Listings are majorly clustered in San Francisco, Daly City, South San Francisco, Burlingame, Foster City, Redwood City and Oakland areas(geographical observations from below map). High Natural Logarithm of Prices are found in San Francisco, West Coast Seaside and Redwood City(San Carlos).
Apparently in San Francisco, ZIP Code areas with High Internet Usage have High AIRBNB Listings. ZIP Code Areas with 5,500 people using internet, have high price for AIRBNB Listings.
tmap_mode("view")
map3 <- tm_shape(bayarea_zip_vars_map3) +
tm_fill(col = "estimate_InternetUse", alpha = 1, palette = "YlGn", n = 5, style = "fisher",title = "Broadband Internet Connection(People)") +
tm_borders() +
tm_text("ZIP", size = 0.3)+
tm_shape(listings_bayarea_sf_transform_copy) +
tm_dots(col = "nl_price", palette = "Reds", n = 3, style = "fisher", title = "Natural Logarithm of Airbnb's Price") +
tm_view(set.view = c(center_lon, center_lat, zoom_level))# Set legend position for interactive view
map3MAP 4: Spatial Auto-Correlation
Calculating list of neighbors(Neighborhoods) and Queen-based Spatial Weights Matrix for Bay Area ZIP Code areas which have Median Household Income:
nb_q1 <- poly2nb(bayarea_zip_vars1, queen = TRUE) # shapefile used, have no NA values of Median Household Income variable
w_queen1 <- nb2listw(nb_q1, style = "B", zero.policy=TRUE) # queen-contiguity based Weights for list of neighbors for all observations
isolates1 <- which(w_queen1$neighbours == "0")
bayarea_zip_vars1_NoZero <- bayarea_zip_vars1[-c(isolates1),] # shapefile where there are no observation with zero neighbours based on queen-contiguityCalculating list of neighbors(Neighborhoods) and Queen-based Spatial Weights Matrix for Bay Area ZIP Code areas which have Internet Usage:
nb_q2 <- poly2nb(bayarea_zip_vars, queen = TRUE) # shapefile used, have no NA values of Broadband Internet Use
w_queen2 <- nb2listw(nb_q2, style = "B", zero.policy=TRUE)
isolates2 <- which(w_queen2$neighbours == "0")
bayarea_zip_vars_NoZero <- bayarea_zip_vars[-c(isolates2),] # shapefile where there are no observation with zero neighbours based on queen-contiguityGlobal Spatial Auto-Correlation:
Calculating list of neighbors(Neighborhoods) and Queen-based Standardized Spatial Weights Matrix for Bay Area ZIP Code areas which have Median Household Income or/and Internet Usage(People) :
nb_q1 <- poly2nb(bayarea_zip_vars1_NoZero, queen = TRUE) # Constructing neighbours list from filtering observations which have no zero neighbours and no NA values of Median Household Income
w_queen_std1 <- nb2listw(nb_q1, style = "W") # creating spatial weights matrix using queen contiguity and row-standardardised weights
nb_q2 <- poly2nb(bayarea_zip_vars_NoZero, queen = TRUE) # Constructing neighbours list from filtering out observations which have zero neighbours and no NA values of Broadband Internet Use
w_queen_std2 <- nb2listw(nb_q2, style = "W") # creating spatial weights matrix using queen contiguity and row-standardardised weightsCalculating Spatial Lag for Median Household Income and Internet Usage(People):
bayarea_zip_vars1_NoZero$sl_MedianHhI <- lag.listw(w_queen_std1, bayarea_zip_vars1_NoZero$estimate_MedianHhI) # calculating spatial lag for Medain Household Income Variable in Bay Area
bayarea_zip_vars_NoZero$sl_InternetUse <- lag.listw(w_queen_std2, bayarea_zip_vars_NoZero$estimate_InternetUse) # calculating spatial lag for Internet Use Variable in Bay Area Moran I’s for Median Household Income in Bay Area ZIP Code Areas:
moran.mc(bayarea_zip_vars1_NoZero$estimate_MedianHhI, w_queen_std1, nsim=1000, alternative="greater") # Moran's I statistic for Median Household Income Variable in Bay Area
Monte-Carlo simulation of Moran I
data: bayarea_zip_vars1_NoZero$estimate_MedianHhI
weights: w_queen_std1
number of simulations + 1: 1001
statistic = 0.39557, observed rank = 1001, p-value = 0.000999
alternative hypothesis: greater
Moran I’s for Median Household Income in Bay Area is 0.396 and p-value is 0.000999. So, Observed Spatial Pattern is statistically Significant. P-value means chances of obtaining Moran I’s value greater than original, given random spatial distribution. If it is less than 0.05, We reject Null Hypothesis i.e, observed spatial pattern is due to random chance. Moran I’s is analogous to Chi-Square Statistic. Spatial distribution of Median Household Income is spatially concentrated than random spatial distribution.
Moran I’s for Internet Usage(People) in Bay Area ZIP Code Areas:
moran.mc(bayarea_zip_vars_NoZero$estimate_InternetUse, w_queen_std2, nsim=1000, alternative="greater") # Moran's I statistic for Interent Use Variable in Bay Area
Monte-Carlo simulation of Moran I
data: bayarea_zip_vars_NoZero$estimate_InternetUse
weights: w_queen_std2
number of simulations + 1: 1001
statistic = 0.23313, observed rank = 1001, p-value = 0.000999
alternative hypothesis: greater
Moran I’s for Internet Usage (People) in Bay Area is 0.23313 and p-value is 0.000999. So, Observed Spatial Pattern is statistically Significant. Spatial distribution of Internet Usage (People) is mild spatially concentrated than random spatial distribution.
Local Spatial Auto-Correlation:
Calculating LISAs for Bay Area ZIP COde Areas for both Median Household Income and Internet Usage(People):
lisa_perm1 <- localmoran_perm(bayarea_zip_vars1_NoZero$estimate_MedianHhI, w_queen_std1, nsim=1000, alternative="two.sided") # Lisa Clustering for Median Household Income in Bay Area
lisa_perm2 <- localmoran_perm(bayarea_zip_vars_NoZero$estimate_InternetUse, w_queen_std2, nsim=1000, alternative="two.sided") # Lisa Clustering for Internet Use Variable in Bay AreaCalculating Quadrants where local moran I’s for observation, with significance cutoff at 0.2. As there are no quadrants at significance level 0.1, it is set by seeing to have least no.of levels for at least one variable:
quadrants1 <- hotspot(lisa_perm1, Prname="Pr(z != E(Ii)) Sim", cutoff=0.2) # Creating quadrants from significance values of local Moran's I statistic calculated from Median Household Income
quadrants2 <- hotspot(lisa_perm2, Prname="Pr(z != E(Ii)) Sim", cutoff=0.2) # Creating quadrants from significance values of local Moran's I statistic calculated from Internet Use VariableLevels of Quadrant 1 :
Low-Low, High-Low, Low-High, High-High
Levels of Quadrant 2 :
Low-Low, High-Low, Low-High, High-High
Adding Quadrants back to Bay Area ZIP Code Areas Shapefile:
bayarea_zip_vars1_NoZero$quadrant1 <- as.character(quadrants1) %>% replace_na("Not significant") # Adding created quadrants to shapefile(no NA values for Median Household Income) where no observation with zero neighbours based on queen-contiguity
bayarea_zip_vars_NoZero$quadrant2 <- as.character(quadrants2) %>% replace_na("Not significant") # Adding created quadrants to shapefile(no NA values for Internet Use Variable) where no observation with zero neighbours based on queen-contiguityborders1 <- tm_shape(bayarea_zip_vars1_NoZero) +
tm_fill() +
tm_borders(col = "black", lwd = 0.2)
borders2 <- tm_shape(bayarea_zip_vars_NoZero) +
tm_fill() +
tm_borders(col = "black", lwd = 0.2)hh1 <- bayarea_zip_vars1_NoZero %>% dplyr::filter(quadrant1 == "High-High") # filtering only 'High-High' quadrant to map
hh_map1 <- tm_shape(hh1) +
tm_fill(col = "royalblue2", alpha=0.8)+
tm_borders(col = "black", lwd = 0.3)
ll1 <- bayarea_zip_vars1_NoZero %>% dplyr::filter(quadrant1 == "Low-Low") # filtering only 'High-High' quadrant to map
ll_map1 <- tm_shape(ll1) +
tm_fill(col = "red2", alpha=0.8)+
tm_borders(col = "black", lwd = 0.3)
lh1 <- bayarea_zip_vars1_NoZero %>% dplyr::filter(quadrant1 == "Low-High") # filtering only 'Low-High' quadrant to map
lh_map1 <- tm_shape(lh1) +
tm_fill(col = "gold", alpha=0.8)+
tm_borders(col = "black", lwd = 0.3)
hl1 <- bayarea_zip_vars1_NoZero %>% dplyr::filter(quadrant1 == "High-Low") # filtering only 'High-Low' quadrant to map
hl_map1 <- tm_shape(hl1) +
tm_fill(col = "darkgreen", alpha=0.8)+
tm_borders(col = "black", lwd = 0.3)
ns1 <- bayarea_zip_vars1_NoZero %>% dplyr::filter(quadrant1 == "Not significant") # filtering only 'Not significant' quadrant to map
ns_map1 <- tm_shape(ns1) +
tm_fill(col = "lightgrey", alpha=0.8)+
tm_borders(col = "black", lwd = 0.3)
hh2 <- bayarea_zip_vars_NoZero %>% dplyr::filter(quadrant2 == "High-High") # filtering only 'High-High' quadrant to map
hh_map2 <- tm_shape(hh2) +
tm_fill(col = "royalblue2", alpha=0.8)+
tm_borders(col = "black", lwd = 0.3)
ll2 <- bayarea_zip_vars_NoZero %>% dplyr::filter(quadrant2 == "Low-Low") # filtering only 'Low-Low' quadrant to map
ll_map2 <- tm_shape(ll2) +
tm_fill(col = "red2", alpha=0.8)+
tm_borders(col = "black", lwd = 0.3)
hl2 <- bayarea_zip_vars_NoZero %>% dplyr::filter(quadrant2 == "High-Low") # filtering only 'High-Low' quadrant to map
hl_map2 <- tm_shape(hl2) +
tm_fill(col = "darkgreen", alpha=0.8)+
tm_borders(col = "black", lwd = 0.3)
ns2 <- bayarea_zip_vars_NoZero %>% dplyr::filter(quadrant2 == "Not significant") # filtering only 'Not significant quadrant to map
ns_map2 <- tm_shape(ns2) +
tm_fill(col = "lightgrey", alpha=0.8)+
tm_borders(col = "black", lwd = 0.3)As There were no quadrants when cutoff is at 0.1, We set cutoff at 0.2 to have minimum no. of Quadrants for atleast one variable. These LISA clusters are likely to arranged by random chance. It means Observed Clustering at whole Bay Area Level has Spatial Structure, whereas LISA Clusters observed are, may be due to random chance. Observed Spatial Pattern in LISA clusters is not statistically significant.
But, also, It is very highly unlikely to say that observed spatial pattern in LISA clusters is just by random chance. This requires further and more advanced Exploratory analysis.
tmap_mode("plot")
lisa_map_cluster1 <- borders1 +
hh_map1 + ll_map1 + hl_map1 +lh_map1 + ns_map1 + # combining all quadrant maps
tm_compass(position = c(0.85, 0.85)) +
tm_scale_bar(position = c(0.625, 0.02)) +
tm_add_legend(type = "fill", col = c("royalblue2", "red2", "darkgreen", "gold", "lightgrey"),
labels = c("High-High", "Low-Low", "High-Low", "Low-High", "Not significant"), title = "LISA cluster(Median Household Income)") +
tm_layout(
legend.title.size = 0.715,
legend.text.size = 0.6,
inner.margins = c(0.22, 0.13, 0.02, 0.08),
legend.position = c(0.05, 0.02),
legend.width = 0.5,
bg.color = "#eaf5fa",
main.title = "LISA Clusters for Median Household Income Variable",
main.title.size = 0.7,
main.title.position = "center"
)
lisa_map_cluster2 <- borders2 +
hh_map2 + ll_map2 + hl_map2 + ns_map2 + # Combining All quadrant maps
tm_compass(position = c(0.85, 0.85)) +
tm_scale_bar(position = c(0.625, 0.02)) +
tm_add_legend(type = "fill", col = c("royalblue2", "red2", "darkgreen", "lightgrey"),
labels = c("High-High", "Low-Low", "High-Low", "Not significant"), title = "LISA cluster(Broadband Internet Use)") +
tm_layout(
legend.title.size = 0.715,
legend.text.size = 0.6,
inner.margins = c(0.185, 0.13, 0.02, 0.08),
legend.position = c(0.05, 0.02),
legend.width = 0.5,
bg.color = "#eaf5fa",
main.title = "LISA Clusters for Broadband Internet Use Variable",
main.title.size = 0.7,
main.title.position = "center"
)tmap_arrange(lisa_map_cluster1, lisa_map_cluster2)Geo-Demographic Classification in Bay Area
We create overlay shapefile by spatial join of Bay Area ZIP Code Areas and AIRBNB Listings. Then, we group by to calculate all variables. Also, Join Average Commercial Electricity rate and No.of Light-Duty Vehicles per Zip Code.
overlay_listings <- st_join(bayarea_zip_vars, listings_bayarea_sf_transform, join = st_intersects) # spatial join to create Overlay Shapefiles
overlay_listings_zip <- overlay_listings %>%
group_by(ZIP) %>%
summarise(
count_airbnb = sum(!is.na(price)), # count non-NA price rows for count
mean_price = mean(price, na.rm = TRUE), # Mean price
mean_review = mean(number_of_reviews, na.rm = TRUE), # No. of Reviews
mean_min_nights = mean(minimum_nights, na.rm = TRUE), # No. of Minimum Nights stay
Median_HhI = mean(estimate_MedianHhI, na.rm = TRUE), # Median Household income
Internet_Use = mean(estimate_InternetUse, na.rm = TRUE) # Internet Usage
)%>%
filter(count_airbnb != 0)
overlay_listings_zip <- overlay_listings_zip %>% left_join(vc_califorina, by = "ZIP") # Join overlay listings with Light-Duty Vehicles count
overlay_listings_zip <- overlay_listings_zip %>% left_join(electric_u, by = "ZIP") #Joining Overlay Listings with Commerical Electricity Rate.
overlay_listings_zip <- na.omit(overlay_listings_zip) # removing NA valuesWe create 8-tuple Vector for multi-variate classification:
airbnb_ra <- c('count_airbnb', 'mean_price', 'mean_review', 'mean_min_nights', 'Median_HhI', 'Internet_Use', 'No_of_Light_Duty_Vehicles', 'Combined_Avg_Comm_Rate') # geodemographic arguments - (8 - tuple) , multivariate classificationSubsetting Overlay Listings Shapefile without geometry attribute:
overlay_listings_zip_NoGeo <- st_drop_geometry(overlay_listings_zip[, airbnb_ra, drop=FALSE]) # we are dropping geometry col in shp
overlay_listings_zip_NoGeo <- overlay_listings_zip_NoGeo[complete.cases(overlay_listings_zip_NoGeo), ] # removing NA ValuesK-Means Clustering:
set.seed(12345) # setting seed
k6cls <- kmeans(overlay_listings_zip_NoGeo, centers=6, iter.max = 1000) # KNN clusteing - 6 clusters (Each Observation will be 8-tuple )
overlay_listings_zip$k6cls <- k6cls$cluster # Type of cluster assigned bacnk to overlay listingsk6cls$centers # centres of 6 clusters, six 8-tuple vectors defines centre of each cluster count_airbnb mean_price mean_review mean_min_nights Median_HhI Internet_Use
1 279.0000 179.9462 53.98208 6.132616 106352.00 42383.00
2 197.1875 278.3737 54.32059 15.048195 171773.62 17882.00
3 48.0000 327.6437 35.09278 14.406388 242917.50 6848.00
4 179.5385 438.6613 30.89614 15.389951 57687.23 15399.62
5 202.0000 306.7908 59.36053 13.594888 128579.62 20916.00
6 256.8235 174.3755 51.10254 14.275260 96413.88 26327.35
No_of_Light_Duty_Vehicles Combined_Avg_Comm_Rate
1 126455.00 0.09763171
2 19991.88 0.09186155
3 10437.00 0.13166995
4 14125.69 0.10183107
5 21415.62 0.09796353
6 25233.29 0.11473432
By observing centers of different clusters, Main types of Neighborhoods can be identified as Tourist-focused areas, Residential Areas, Economically-diverse areas and Vehicle-dependent areas etc. Tourist-Focused Areas can be identified by High Count, High Internet Usage, Low Median Household Income. Residential Areas can be identified by High Median Household Income, High Internet Usage, High No.of Light-duty Vehicles etc. Economically Diverse ares can be identified by mediummedian household income (Median_HhI) and internet usage (Internet_Use). Vehicle-dependent areas can be identified by High Vehile count.
Characteristics helping to delineate typology are AIRBNB Metrics(‘count_airbnb’, ‘mean_price’, ‘mean_review’, ‘mean_min_nights’), Median Household Income, Internet Usage(People), No. of Light-Duty vehicles, Average Electricity Commercial Rate.
Using this Classification, We can target areas most in need, by, like supporting Economically weak Neighborhoods which can be identified by Low Median Household Income, Internet Usage(People) for Development Programs and Infrastructure Improvements. For Managing Tourist Activity, Neighborhoods with high Airbnb count, urban planning committees can focus on minimizing impact of Tourism on local communities, like affordable housing. For Transportation and urban planning, Neighborhoods with High Vehicle count might benefit from improved Public Transportation and Development of Local Amenities to reduce travelling. For Community Services like Digital Hubs, Which might improve overall livelihood in Neighborhoods with Low Internet Usage and Varied Median Hosehold Income.
tmap_mode("plot")
map_cluster1 <- tm_basemap() +
tm_shape(overlay_listings_zip) +
tm_fill(col = "k6cls", palette = viridis(256), style = "cont", title = " Type of Cluster") +
tm_compass(position = c(0.85, 0.88)) +
tm_scale_bar(position = c(0.58, 0.02)) +
tm_borders(col = "white", lwd = 0.2)+
tm_layout(
legend.title.size = 0.7,
legend.text.size = 0.6,
inner.margins = c(0.2, 0.13, 0.02, 0.08),
legend.position = c(0.05, 0.02),
legend.width = 0.5,
bg.color = "#eaf5fa",
main.title = "Type of Clusters in Bay Area", # Add the map title within tm_layout
main.title.size = 0.7,
main.title.position = "center"
)
map_veh <- tm_basemap() +
tm_shape(overlay_listings_zip) +
tm_fill(col = "No_of_Light_Duty_Vehicles", palette = "YlGn", n = 5, style = "fisher", title = "Vehicles per ZIP") +
tm_compass(position = c(0.85, 0.88)) +
tm_scale_bar(position = c(0.58, 0.02)) +
tm_borders(col = "black", lwd = 0.2)+
tm_layout(
legend.title.size = 0.7,
legend.text.size = 0.6,
inner.margins = c(0.185, 0.13, 0.02, 0.08),
legend.position = c(0.05, 0.02),
legend.width = 0.5,
bg.color = "#eaf5fa",
main.title = "No. of Vehicles(Light-Duty) in Bay Area", # Add the map title within tm_layout
main.title.size = 0.7,
main.title.position = "center"
)
map_electric <- tm_basemap() +
tm_shape(overlay_listings_zip) +
tm_fill(col = "Combined_Avg_Comm_Rate", palette = "YlGn", n = 5, style = "fisher", title = "Commerical Electrical Rate") +
tm_compass(position = c(0.85, 0.88)) +
tm_scale_bar(position = c(0.58, 0.02)) +
tm_borders(col = "black", lwd = 0.2)+
tm_layout(
legend.title.size = 0.7,
legend.text.size = 0.6,
inner.margins = c(0.185, 0.13, 0.02, 0.08),
legend.position = c(0.05, 0.02),
legend.width = 0.5,
bg.color = "#eaf5fa",
main.title = "Commercial Electrical Rate in Bay Area", # Add the map title within tm_layout
main.title.size = 0.7,
main.title.position = "center"
)tmap_arrange(map_cluster1, map_veh, map_electric, nrow =1, ncol= 3)Conclusion
Geo-Demographic Classification allows for making data-informed decisions for various policy making and Societal Improvements. In Reality, It is much more nuanced, we need to have understanding of different neighborhood dynamics, real-estate markets and other Socio-Economic Factors. But, With target areas based on neighborhood charactertistics, Policymakers can very effectively address local issues and enhance overall community well-being.
Though Multi-Variate Clustering is performed, we neglected timeline factor, which introduces temporal inconsistencies. There can be biases introduced because of New forms of spatial data like AIRBNB Data. However, Apart from above insights in geodemographic classifcation, To make well-informed decisions for policymakers, We need more sophisticated approaches and Real-Time Verification procedures.
References
GIS Geography. 2023. Choropleth Maps - A Guide to Data Classification. [online] Available at: https://gisgeography.com/choropleth-maps-data-classification/ (Accessed : 08 January 2024).
Ross, S. (2009) Introduction to Probability and Statistics for Engineers and Scientists. 4th edn. London: Elsevier Academic Press.
United States Geological Survey (2023) What is the State Plane Coordinate System? Can GPS provide coordinates in these values?. Available at: https://www.usgs.gov/faqs/what-state-plane-coordinate-system-can-gps-provide-coordinates-these-values#:~:text=The%20State%20Plane%20Coordinate%20System%20(SPCS)%2C%20which%20is%20only,the%20state%27s%20size%20and%20shape. (Accessed: 08 January 2024).
United States Census Bureau (2023) ZIP Code Tabulation Areas (ZCTAs). Available at: https://www.census.gov/programs-surveys/geography/guidance/geo-areas/zctas.html (Accessed: 08 January 2024).
ArcGIS Hub (2020) USA State Plane Zones NAD83. Available at: https://hub.arcgis.com/datasets/esri::usa-state-plane-zones-nad83/about (Accessed: 08 January 2024).
‘zip-Determining which US zipcodes map to more than one state or more than one city?-Geographic Information Systems’ (2021) Geographic Information Systems Stack Exchange. Available at: https://gis.stackexchange.com/questions/53918/determining-which-us-zipcodes-map-to-more-than-one-state-or-more-than-one-city (Accessed: 08 January 2024).
UC Berkeley GeoData Repository (2004) Bay Area ZIP Code Areas[Dataset]. Available at: https://geodata.lib.berkeley.edu/catalog/ark28722-s7888q (Accessed: 08 January 2024).
California Open data (2023) Vehicle Fuel Type Count by Zip Code[Dataset]. Available at: https://data.ca.gov/dataset/vehicle-fuel-type-count-by-zip-code (Accessed: 08 January 2024).
National Renewable Energy Laboratory (2022) U.S. Electric Utility Companies and Rates: Look-up by Zipcode (2020)[Dataset]. Available at: https://catalog.data.gov/dataset/u-s-electric-utility-companies-and-rates-look-up-by-zipcode-2020 (Accessed: 08 January 2024).